Home Credit Default Risk (HCDR)

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Some of the challenges

  1. Dataset size
    • (688 meg uncompressed) with millions of rows of data
    • 2.71 Gig of data uncompressed

Kaggle API setup

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library

For more detailed information on setting the Kaggle API see here and here.

Dataset and how to download

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

image.png

Downloading the files via Kaggle API

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.

Imports

Data files overview

Data Dictionary

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

image.png

Application train

Application test

The application dataset has the most information about the client: Gender, income, family status, education ...

The Other datasets

Exploratory Data Analysis Phase -1

Descriptive statistics

● A data dictionary of the raw features.
● Pandas profiling in jupyter notebook.
● We did descriptive analysis on the dataset such as data type of each feature, dataset size (rows and columns = 307511, 122), and summary statistics such as the number of observations, mean, standard deviation, maximum, minimum, and quartiles for all features and split of data is as follows Train: 70%, Test 20%, Validation 10%.
● We generated charts on descriptive statistics of the target dataset.

Summary of Application train

Missing data for application train

Distribution of the target column

Correlation with the target column

Applicants Age

Applicants occupations

Target Vs borrowers based on gender

Male most likely to default than Female based on percentage of defaulter_count(Second Graph)

Gender Vs Income based on Target

Own House count based Target

Not a significant difference, but borrowers who own a house are more likely to pay

Own car count based Target

Borrowers owning a car are more likely to pay on time

Occupation type count based on Target

Occupation type vs income based on Target

Defaulters percentage is less when IC_ratio is either Low or High

Repayers to Applicants Ratio

Correlation of the positive days since birth and target

Correlation of the positive days since employement and target

Fetching important releavant features

Pandas profiling (contains correlation graphs between features)

Dataset questions

Unique record for each SK_ID_CURR

previous applications for the submission file

The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.

Histogram of Number of previous applications for an ID

Can we differentiate applications by low, medium and high previous apps?
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)

Joining secondary tables with the primary table

In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?

Joining previous_application with application_x

We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.

Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:

To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).

When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:

  1. Preprocess each of the non-application data sets, thereby generating many new (derived) features, and then joining (aka merge) the results with the application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]

I want you to think about this section and build on this.

Roadmap for secondary table processing

  1. Transform all the secondary tables to features that can be joined into the main table the application table (labeled and unlabeled)
    • 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
    • 'previous_application', 'POS_CASH_balance'

agg detour

Aggregate using one or more operations over the specified axis.

For more details see agg

DataFrame.agg(func, axis=0, *args, **kwargs**)

Aggregate using one or more operations over the specified axis.

Multiple condition expressions in Pandas

So far, both our boolean selections have involved a single condition. You can, of course, have as many conditions as you would like. To do so, you will need to combine your boolean expressions using the three logical operators and, or and not.

Use &, | , ~ Although Python uses the syntax and, or, and not, these will not work when testing multiple conditions with pandas. The details of why are explained here.

You must use the following operators with pandas:

Missing values in prevApps

feature engineering for prevApp table

feature transformer for prevApp table

Join the labeled dataset

Join the unlabeled dataset (i.e., the submission file)

Processing pipeline

OHE when previously unseen unique values in the test/validation set

Train, validation and Test sets (and the leakage problem we have mentioned previously):

Let's look at a small usecase to tell us how to deal with this:

This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.

Here is a example that in action:

# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 
               'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']

# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])

OHE case study: The breast cancer wisconsin dataset (classification)

Please this blog for more details of OHE when the validation/test have previously unseen unique values.

HCDR preprocessing

Our Baseline Model

To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model

Building Logistic Regression baseline pipeline

Loss function used (data loss and regularization parts) in latex

Submission 1

Improving the AUC

Submission 2

Approach 2

Submission 3

Phase-2

Feature Engineering

Additional EDA

Discarding features with null values more than 30%

Filling null values in NAME_TYPE_SUITE column with "Other_C"

Filling null values in the columns containing keyword AMT_REQ_CREDIT column with 0

Filling null values in the column containing keyword CNT_SOCIAL_CIRCLE with 0¶

Filling null values in the column CNT_FAM_MEMBERS with median

Filling null values in the column AMT_GOODS_PRICE with median fopr the respective category

Dropping one single row with column DAYS_LAST_PHONE_CHANGE as null

Dropping 12 rows with column AMT_ANNUITY as null

Checking highest correlated features with External Source to replace the null values with

Checking the null values in the Train dataset

Training and testing with the selected columns

Adding Additional relevant features felt

Hyperparameter tuning(Grid Search)

Decision Making Tree

Lasso Regression

Ridge Regression

Logistic Regression

Kaggle submission via the command line API

Screenshot of kaggle submission

Screen%20Shot%202022-04-19%20at%2010.35.28%20PM.png

report submission

Click on this link

Write-up (Phase -2 )

Abstract

HomeCredit uses Machine Learning Modeling to provide unsecured loans based on a user's credit history, repayment patterns, and other data. Credit history is a metric that explains a user's credibility based on factors such as the user's average/minimum/maximum balance, Bureau scores recorded, salary, and repayment patterns. We use kaggle datasets to undertake exploratory data analysis, develop machine learning pipelines, and evaluate models across many evaluation metrics for a model to be deployed as part of this project. We offer feature engineering, hyperparameter tuning, and modeling pipelines in phase 2. We used a regression of baseline inputs and chosen features for Logistic, Decision Making Tree, Lasso, and Ridge Regressions. The baseline pipeline has the highest test accuracy with 92%, followed by Logistic regression with 91.98%, then Decision Making tree, and finally Lasso and Ridge being the least accurate.

Project Description

Data Description

  1. Train dataset in application.csv

Screen%20Shot%202022-04-19%20at%207.49.10%20PM.png

Screen%20Shot%202022-04-19%20at%207.51.11%20PM.png

Workflow

Screen%20Shot%202022-04-12%20at%209.43.34%20PM.png

Feature Engineering and transformers

Screen%20Shot%202022-04-19%20at%207.51.37%20PM.png

Step 1: - We discarded columns which are having more than 30% of Null Values The above table shows the count of NA values of remaining columns and their percentages

Step 2: - AMT_REQ_CREDIT_BUREAU_HOUR, AMT_REQ_CREDIT_BUREAU_DAY, AMT_REQ_CREDIT_BUREAU_WEEK, AMT_REQ_CREDIT_BUREAU_MON, AMT_REQ_CREDIT_BUREAU_QRT, AMT_REQ_CREDIT_BUREAU_YEAR gives the number of enquiries done. As the number is not available, we can assume no enquiries are made. So, we replace NA values with 0

Step 3: - OBS_30_CNT_SOCIAL_CIRCLE, DEF_30_CNT_SOCIAL_CIRCLE, OBS_60_CNT_SOCIAL_CIRCLE, DEF_60_CNT_SOCIAL_CIRCLE gives us number of immediate connections who have a loan in Home Credit. Since we don’t have data, we can assume there are no immediate connections. So, we replace NA values with 0.

Step 4: - CNT_FAM_MEMBERS NA values are filled with median

Step 5: - AMT_GOODS_PRICE values are depending on NAME_FAMILY_STATUS
categories. So we replaced NA values with medians w.r.t NAME_FAMILY_STATUS. We can see in below figure that AMT_GOODS_PRICE is depends on NAME_FAMILY_STATUS.

Screen%20Shot%202022-04-19%20at%207.51.56%20PM.png

Step 6: - Dropped DAYS_LAST_PHONE_CHANGE column because there is only 1 row.

Step 7: - To replace EXT_SOURCE_2 NA values, we found top 5 variables which are highly correlated. REGION_RATING_CLIENT is highly correlated with EXT_SOURCE_2. As REGION_RATING_CLIENT is categorical, we fill NA values with median based on categories.

Step 8: - To replace EXT_SOURCE_3 NA values, we found top 5 variables which are highly correlated. DAYS_BIRTH is highly correlated with EXT_SOURCE_2. As DAYS_BIRTH is numerical, we fill NA values using Linear Regression.

Provide additional features added to training data

Impact of these new features added to the model

Screen%20Shot%202022-04-19%20at%207.52.22%20PM.png

Why we chose the method and approach

Hyperparameter tuning

Decision Making Tree

Screen%20Shot%202022-04-19%20at%2010.39.36%20PM.png

Lasso Regression

Screen%20Shot%202022-04-19%20at%207.58.59%20PM.png

Ridge Regression

Screen%20Shot%202022-04-19%20at%207.59.40%20PM.png

Logistic Regression

Screen%20Shot%202022-04-19%20at%208.00.18%20PM.png

Pipelines

The goal here is to predict whether the customer who has reached out to Home Credit for a loan is a defaulter or not. Therefore, this is a supervised classification task, and the output of the target variable is either 1 or 0 where 1 means non-defaulter and 0 means defaulter.

Logistic Regression

Logistic Regression can be used as a baseline model along with feature selection techniques like RFE,PCA, SelectKbest.

In statistics, the (binary) logistic model (or logit model) is a statistical model that models the probability of one event (out of two alternatives) taking place by having the log-odds (the logarithm of the odds) for the event be a linear combination of one or more independent variables ("predictors"). In regression analysis, logistic regression[1] (or logit regression) is estimating the parameters of a logistic model (the coefficients in the linear combination). Formally, in binary logistic regression there is a single binary dependent variable, coded by a indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable (two classes, coded by an indicator variable) or a continuous variable (any real value). The corresponding probability of the value labeled "1" can vary between 0 (certainly the value "0") and 1 (certainly the value "1"), hence the labeling;[2] the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names

loss function:

LRCOSTF.png

Decision Trees:

A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.

In decision analysis, a decision tree and the closely related influence diagram are used as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are calculated.

loss function:

DecisionTreeCF.jpeg

Lasso Regression:

LASSO is a penalized regression approach that estimates the regression coefficients by maximizing the log-likelihood function (or the sum of squared residuals) with the constraint that the sum of the absolute values of the regression coefficients, ∑kj=1∣∣βj∣∣, is less than or equal to a positive constant s. One interesting property of LASSO is that the estimates of the regression coefficients are sparse, which means that many components are exactly 0. That is, LASSO automatically deletes unnecessary covariates.

loss function:

Screen%20Shot%202022-04-19%20at%208.28.50%20PM.png

Regularization of Logistic Regression:

Ridge Regression:

Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where linearly independent variables are highly correlated.This method performs L2 regularization. When the issue of multicollinearity occurs, least-squares are unbiased, and variances are large, this results in predicted values being far away from the actual values. n ridge regression, the first step is to standardize the variables (both dependent and independent) by subtracting their means and dividing by their standard deviations. This causes a challenge in notation since we must somehow indicate whether the variables in a particular formula are standardized or not. As far as standardization is concerned, all ridge regression calculations are based on standardized variables. When the final regression coefficients are displayed, they are adjusted back into their original scale. However, the ridge trace is on a standardized scale.

loss function:

Screen%20Shot%202022-04-19%20at%208.29.11%20PM.png

Machine Learning Pipeline Steps:

  1. Data Preprocessing
    a. Gather Kaggle's raw data.
    b. Perform exploratory data analysis on the dataset.
    c. Feature engineering for improving performance of machine learning model.
  2. Model Selection
    a. Develop and test various candidate models, such as "Logistic Regression","Decision Making Trees", "Random Forest", and "SVMs.
    b. Based on the evaluation measures, select the best model.
    c. Use various evaluation metrics like "accuracy," "F1 Score," and "AUC."
  3. Prediction Generation
    a. Prepare the new data and extract the features as before.
    b. Once the winning model has been chosen, use it to make predictions on the new data.

Results and discussion of results

Logistic Regression

Decision Making Tree

Screen%20Shot%202022-04-19%20at%208.01.03%20PM.png

Lasso Regression

Screen%20Shot%202022-04-19%20at%208.01.46%20PM.png

Ridge Regression

Screen%20Shot%202022-04-19%20at%208.02.47%20PM.png

Logistic Regression GridSearch

Screen%20Shot%202022-04-19%20at%208.03.38%20PM.png

Conclusion

The objective of the HCDR project is to predict the repayment ability of the financially under-served population. This project is important because well-established predictions are necessary to both the loaner and borrower. Real-time Homecredit is able to display loan offers to its customers with the maximum amount and APR using their ML pipelines where fetching data from the data providers via APIs, performing EDA and fitting it to the model to generate scores occurs in microseconds of time. Hence Risk analysis becomes very critical in this regard where NPA(Non-Performing Asset) expected is less than 5% in order to run a profitable business.

Credit history is a measure explaining the credibility of a user generated using parameters like average/min/max balance maintained by the user, Bureau scores reported, salary etc and repayment patterns could be analysed using the timely defaults/repayments made by the user in the past. Alternate data includes other parameters like geographic data, social media data, calling/SMS data etc. As part of this project we would be using the datasets provided by kaggle to perform exploratory data analysis, build machine learning pipelines and evaluate the models across several evaluation metrics for a model to be deployed.

In phase 2, we estimated several models including both classification and regression models. We did feature selection, data imputation, and hyperparameter tuning. First, we did feature selection and imputation. We filled in the missing values of selected features. Then, we decided to add relevant features based on our prior knowledge. Next, we tuned the hyperparameters with the help of GridSearchCV. To find the best model, we trained and evaluated several models like Logistic Regression, Decision Tree Model, Lasso Regression and Ridge Regression. In phase 2, classification models cannot win the baseline model. Among regression models, the ridge regression model shows the best performance.

In phase 3, we plan to implement a deep learning model and build additional models in PyTorch. The problems that we are currently facing are that with the feature selection and imputation step, we cannot improve the test accuracy or AUC of the baseline model. Unlike regression models, we cannot develop a classification model better than the baseline. So, to address these issues, in phase 3, we plan to build a multilayer model in PyTorch for loan default classification. As a stretch goal, we will develop and implement a new multitask loss function in Pytorch. These will be submitted to Kaggle and we will report our scores.

Kaggle Submission

Screen%20Shot%202022-04-19%20at%2010.35.28%20PM.png

References

Some of the material in this notebook has been adopted from here

https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8

https://online.stat.psu.edu/stat857/node/216/

TODO: Predicting Loan Repayment with Automated Feature Engineering in Featuretools

Read the following: